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ARRANGEMENT AND METHOD FOR RECOGNIZING A 
PREDETERMINED VOCABULARY IN SPOKEN LANGUAGE WITH A 
COMPUTER 

The invention is directed to an arrangement and a method for the 
recognition of a predetermined vocabulary in spoken language by a computer. 

A method and an arrangement for the recognition of spoken language are 
known from^^. Particularly until a recognized word sequence is obtained from a 
digitalized^v&lce signal, a signal analysis and a global search that accesses an acoustic 
model and a linguistic model of the language to be recognized are implemented in the 
recognition of spoken language. The acoustic model is based on a phoneme inventory 
realized with the assistance of hidden Markov models (HMMs). With the assistance 
of the acoustic model, a suitable probable word sequences is determined during the 
global search for feature vectors that proceeded from the signal analysis and this is 
output as recognized word sequence. The words to be recognized are stored in a 
pronunciation lexicon together with a phonetic transcription. The relationship is 
explained in depth i n "W^ 

For explaining the subsequent comments, the terms that are employed 
shall be briefly discussed here. 

As phase of the computer-based speech recognition, the - cignal analys is 
partieularlyu^mpmes a Fourier transformation of the digitalized voice signal and a 




feature extraction following thereupon. It proceeds from the signal analysis 

ensues every ten milliseconds. From overlapping time segments with a respective 
duration of, for example, 25 milliseconds, approximately 30 features are determined 
on the basis of the signal analysis and combined to form as feature vector. The 
components of the feature vector describe the spectral energy distribution of the 
appertaining signal excerpt. In order to arrive at this energy distribution, a Fourier 
transformation is implemented on every signal excerpt (25 ms time excerpt). The 
components of the feature vector result from the presentation of the signal in the 
frequency domain. After the signal analysis, thus, the digitalized voice signal is 
present in the form of feature vectors. 



These feature vectors are supplied to the fdobftl-search , a further phase of 
the speech recognition. As already mentioned, the global search makes use of the 

teT and, potentially, of the linguistic model in order to image the sequence 
of feature vectors onto individual parts of the language (vocabulary) present as model. 
A language is composed of a given plurality of sounds, - what are referred to as 
phonemes, whose totality is referred to as phoneme inventory. The vocabulary is 
modelled by phoneme sequences and stored in a pronunciation lexicon. Each 
phoneme is modelled by at least one HMM. A plurality ofHMMs yield a stochastic 
automaton that comprises statusses and status transitions. The time execution of the 
occurrence of specific feature vectors (even within a phoneme) can be modelled with 
HMMs. A corresponding phoneme model thereby comprises a given plurality of 
statusses that are arranged in linear succession. A status of an HMM represents a part 
of a phoneme (for example an excerpt of 10 ms length). Each status is linked to an 
emission probability, which, in particular, is distributed according to Gauss, for the 
feature vectors and to transition probabilities for the possible transitions. A 
probability with which a feature vector is observed in an appertaining status is 
allocated to the feature vector with the emission distribution. The possible transitions 
are a direct transition from one status into a next status, a repetition of the status and a 
skipping of the status. 

A joining of the HMM statusses to the appertaining transitions over the 
time is referred to as trellis. The principle of dynamic programming is employed in 
order to determined the acoustic probability of a word: the path through the trellis is 
sought that exhibits the fewest errors or, respectively, that is defined by the highest 
probability for a word to be recognized. 

The result of the global search is the output or, respectively, offering of a 
recognized word sequence that derives taking the acoustic model (phoneme 
inventory) for each individual word and the language model for the sequence of words 
into consideration. 

discloses a method for speaker adaptation based on a MAP estimate 
(MAP = maximum a posteriori) of HMM parameters. 
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According to m, thus, it is recognized that a speaker-dependent system for speech 
recognition normally supplies better results than a speaker-independent system, 
insofar as adequate training data are available that enable a modelling of the speaker- 
dependent system. However, the speaker-independent system achieves the better 
5 results as soon as the set of speaker-specific training data is limited. One possibility 
for performance enhancement of both systems, i.e. of both the speaker-dependent as 
well as the speaker-independent system for speech recognition, is comprised in 
employing previously stored datasets of a plurality of speakers such that a small set of 
training data also suffices for modelling a new speaker with adequate quality. Such a 
10 training method is called speaker adaptation. In [2], the speaker adaptation is 

particularly implemented by a MAP estimate of the hidden Markov model parameters. 

Results of a method for recognizing spoken language generally deteriorate 
as soon as characteristic features of the spoken language deviate from characteristic 
features of the training data. Examples of characteristic features are speaker qualities 
15 or acoustic features that influence the articulation of the phonemes in the form of 
slurring. 

The approach ombarlcod upon in [2] for speaker adaptatiorHirconaprised in 
"post-estimating"parameter values of the hidden Markov models, whereby this 
processing in implemented "offline", i.e. not at the run time of the method for speech 
2 0 recognition. 

J. Takami et al., "Successive State Splitting Algorithm for Efficient 
Allophone Modeling", ICASSP 1992, March 1992, pages 573 through 576, San 
Francisco, USA, discloses a method for recognizing a predetermined vocabulary in 
spoken language wherein states are split in a hidden Markov model. The probability 

2 5 density function of the respective states is also split therefor. . % 
The^objeet of the invention is mmprlreH gpRrjfyjpg an arrangement and 

a method for recognizing a predetermined vocabulary in spoken language, whereby, 
in particular, an adaptation of the acoustic model is accomplished -at- the run time (i.e., 
"online"). 

3 0 gJMg-nKjggtJc np.hi n v r .H nrim rtl m u I n t liK fH* *Hg gs^nf tfrfi | >d< »p ftTiHqni t patent 
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s For achieving the object, a method for recognizing a predetermined 

^-^vocabulary in spoken language with a computers recited wherein a voice signal is 

determined from the spoken language. The voice signal is subjected to a signal 
5 analysis from which feature vectors for describing the digitalized voice signal 
proceed. A global search is implemented for imaging the feature vectors onto a 




4 

language present in modelled form, whereby each phoneme of the language is 
described by a modified hidden Markov model and each status of the modified hidden 
Markov model is described by a probability density function. An adaptation of the 
probability density function ensues such that it is split into a first probability density 
5 function and into a second probability density function. Finally, the global search 
offers a recognized word sequence. 

^ Let it llieieb y be noted that the probability density function that is split 
into a first and into a second probability density function can represent an emission 
distribution for a predetermined status of the modified hidden Markov model, 
1 0 whereby this emission distribution can also contain a superimposition of a plurality of 
probability density functions, for example Gauss curves (Gaussian probability density 
distributions. 

A recognized word sequence can thereby also comprise individual sounds 
or, respectively, only a single word. 
1 5 If, in the framework of the global search, a recognition is affected with a 

high value for the distance between spoken language and appertaining word sequence 
determined by the global search, then the allocation of a zero word can ensue, said 
zero word indicating that the spoken language is not being recognized with adequate 
quality. 

2 0 By splitting the probability density function, one advantage of the 

invention is to create new regions in a feature space erected by the feature vectors, 
these new regions comprising significant information with reference to the digitalized 
voice data to be recognized and, thus, assuring an improved recognition. 

Orre^d ovolopmcnt is compris e d th e r e in tha t- the probability density 

2 5 function is split into the first and into the second probability density function when 

the drop off of an entropy value lies below a predetermined threshold. 

The splitting of the probability density function dependent on an entropy 
value proves extremely advantageous in practice. 

The entropy is generally a measure of an uncertainty in a prediction of a 

3 0 statistical event. In particular, the entropy can be mathematically defined for 
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Gaussian distributions, whereby there is a direct logarithmic dependency between the 
scatter a and t^ntro^^ ^VxA^oAV 

Another dc v u lupm cnt ■ of the invention is " comprised Iheiein thai th e 
probability density functions, particularly the first and the second probability density 
5 function respectively comprise at least one Gaussian distribution. 

The probability density function of the status is approximated by a sum of 
a plurality of Gaussian distributions. The individual Gaussian distributions are called 
modes. In the recited method, in particular, the modes are considered isolated from 
one another. One mode is divided into two modes in every individual split event. 
1 0 When the probability density function was formed of m modes, then it is formed of 
M+l modes after the split event. When, for example, a mode is assumed to be a 
Gaussian distribution, then an entropy can be calculated, as shown in the exemplary 
embodiment. 

An online adaptation is advantageous because the method continues to 
1 5 recognize speech without having to be set to the modification of the vocabulary in a 
separate training phase. A self-adaptation ensues that, in particular, becomes 
necessary due to a modified co-articulation of the speakers due to an addition of a new 
word. 

The online adaptation, accordingly, requires no separate calculation of the 
2 0 probability density functions that would in turn be responsible for a non-availability 
of the system for speech recognition. ( 

-One dcvclopm o nt o f the invention rs-e ompri s ed therein that - identical 
standard deviations are defined for the first probability density function and for the 
second probability density function. A first average of the first probability density 
2 5 function and a second average of the second probability density function are defined 
such that the first average differs from the second average. 

This is an example for the weighting of the first and second probability 
density function split from the probability density function. Arbitrarily other 
weightings are also conceivable that are to be adapted to the respective application. 
0^30 Tnirfeoi,-ui i u d e velopment ia that the method is multiply implemented in 

succession and, thus, a repeated splitting of the probability density function ensues. 



/v ^ ^Developmerifs of the invention derive from the dependent claims. 1 AwAv**_ v 

Annther-so hitiQn of the object is co m pr ised in-c pocifym g an arrangement 
with a processor unit that is configured such that the following steps can be 
implemented: 

5 a) a digitalized voice signal is determined from the spoken language; 

b) a signal analysis ensues on the digitalized voice signal, feature 
vectors for describing the digitalized voice signal proceeding 
therefrom; 

c) a global search ensues for imaging the feature vectors onto a 

1 0 language present in modelled form, whereby phoneme fsicj of the 

A 

language can be described by a modified hidden Markov model 
and each status of the hidden Markov model can be described by a 
probability density function; - 

d) the-beeemes-EsicJ. probability density ftmction^s^dapted by 
1 5 modification of the vocabulary in that the probability density 

function is split into a first probability density function and into a 
q^j second probability density function; <X<\4 

e) the global search offers a recognized word sequence 

<• This arrangement is especially suited for the implementation of the 
2 Q in ventive method or of one of its developments explained above. 

-Exemplar y embo di m e nts of the iiivciiliun aie piesenled and e xplain ed — 
bekr w with reference to the draw ings 
jShown arc — . 
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Figure 1 shows an arrangement or, respectively, a method for the 
recognition of spoken language. The introduction to the specification is referenced 
for explaining the terms employed below. 

In a signal analysis 102, a digitalized voice signal 101 is subjected to a 

3 0 Fourier transformation 103 with following feature extraction 104. The feature vectors 

105 are communicated to a system for global searching 106. The global search 106 




considers both an acoustic model 107 as well as a linguistic model 108 for 
determining the recognized word sequence 109. Accordingly, the digitalized voice 
signal 101 becomes the recognized word sequence 109. 

The phoneme inventory is simulated in the acoustic model 107 on the 
5 basis of hidden Markov models. 

A probability density function of a status of the hidden Markov model is 
approximated by a summing-up of individual Gaussian modes. A mode is, in 
particular, a Gaussian bell. A mixing of individual Gaussian bells and, thus, a 
modelling of the emission probability density function arises by summing up a 

10 plurality of modes. A decision is made on the basis of a statistical criterion as to 
whether the vocabulary of the speech recognition unit to be recognized can be 
modelled better by adding further modes. In the present invention, this is particularly 
achieved by incremental splitting of already existing modes when the statistical 
criterion is met. 

1 5 The entropy is defined by 

00 

Hp = - I p(x) log2 p(x) die (1) 
— oo 

given the assumption that p(*) is a Gaussian distribution with a diagonal covariance 
matrix, i.e. 



pOO - jrfr an) - ^ 



exp 



f ( 

_ 1 £ i x n - Hn) 



2 ^ n 



a n J 



(2) 



one obtains 



N 

Hp = X lo 92 ^Tte a n (3), 
n = l 



whereby 

2 0 |x references the anticipated value, 
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a n references the scatter for each component n, and 
N references the dimension of the feature space. 

The true distribution p(*) is not known. It is, in particular, assumed to be 
a Gaussian distribution. In the acoustic model, the probability 

p(x) 

is approximated 

5 with 



p(x) = <J\f(& cr n ), 



on the basis of random samples, whereby 
L 

L 



M = 7 Z *1 

1=1 



represents an average over L observations. The corresponding entropy as function of 
(1 is established by 

00 

Hp(A) = - J p(x) log 2 p(x) dx (4) , 

— 00 



which ultimately leads to 



Hp(A) = H p + £ ^ "/ n)2 log 2 Vi 



(5) 



n = l 



10 The anticipated value E:j(n n - ju n ) 2 } amounts to ~ cr^ , so that the 

anticipated value of Hp(jLi) is given as 

Hp = EjHp(|i)} - H p + ^log 2 -Vi (6) . 

Equation (3) thus derives for the entropy of a mode that is defined with a 
Gaussian distribution with a diagonal covariance matrix. The process is now 
approximated with an estimating. The entropy of the approximated process derives as 

N r- 

H = H + — log 2 Ve (7) . 

L 




The estimate is all the better the higher the number L of random samples 
is, and the estimated entropy ft becomes all the closer to the true entropy H. 
Let 

p(x) - jvfa, o n ) ( 8 ) 

be the mode to be divided. It is also assumed that the two Gaussian distributions, that 
5 arise as a result of the division process have identical standard deviations a s and are 
identically weighted. This yields 

p s (x) = ™ jrfii, V s ) + i sjv{j4> aS ) o) • 

Given the assumption that \x x ~ pLj, \l 2 ~ (x 2 and that m is at a sufficiently 
great distance from ^i 2 , the entropy of the split probability density function 
respectively derives as 

H s = 1 - X lo< 32 ^2«e <*n + \ [ lo 9 2 Ve j- + log 2 Ve (10) . 

n = l 

10 As division criterion, a reduction of the entropy as a result of the split 

event is required, i.e. 

H — H S > C (ID , 

whereby C (with C > 0) is a constant that represents the desired drop of the entropy. 
When 

| = Li = L 2 (12) 
is assumed, then deriving as a result thereof is 
* c n f— N 

]T iog 2 — > log 2 Ve - + 1 + C (13) . 

n = l °n 



1 5 One possibility of determining the mid-points of the two new modes is 

disclosed below. A preferred default is comprised vb l meeting the criterion for the 
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splitting. In the recited example, the value of p. is allocated to £f . m| receives a 

maximum likelihood estimate of those observations that are imaged onto p. in the 
Viterbi path. These stipulations merely reveal one possibility without any intent of a 
limitation of the disclosed method to this possibility. 



The following steps of the exemplary application shows the embedding 



into an arrangement for speech recognition or, respectively, a method for speech 
recognition. 

Step 1: Initialization: jlf = A , jlf = A - 

Step 2: Recognizing the expression, analyzing the Viterbi path; 
Step 3: For every status and for every mode of the Viterbi path: 



Step 3.1: 
Step 3.2: 



define a n ; 

define L 2 on the basis of those observations that lie closer 
to /If ^an to £f and set L = L 2 . If flf and Sf are identical, 



Step 3.3: 



then assign the second half to the feature vectors Sf the 
first half to the feature vectors £if . 
correspondingly defined^ on the basis of the L 2 



expressions; 



Step 3.5: 
Step 3.6: 



Step 3.4: 



Re-determine jlf on the basis of the average of those 
observations that lie closer to jlf than to Sf ; 
interpret division criterion according to Equation (13); 
if division criterion according to Equation (13) is positive, 
generate two new modes with the centers and \i2 • 



Step 4 



Go to step 2. 
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